6,128 research outputs found
Adaptive Huber Regression
Big data can easily be contaminated by outliers or contain variables with
heavy-tailed distributions, which makes many conventional methods inadequate.
To address this challenge, we propose the adaptive Huber regression for robust
estimation and inference. The key observation is that the robustification
parameter should adapt to the sample size, dimension and moments for optimal
tradeoff between bias and robustness. Our theoretical framework deals with
heavy-tailed distributions with bounded -th moment for any . We establish a sharp phase transition for robust estimation of regression
parameters in both low and high dimensions: when , the estimator
admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on
the data, while only a slower rate is available in the regime .
Furthermore, this transition is smooth and optimal. In addition, we extend the
methodology to allow both heavy-tailed predictors and observation noise.
Simulation studies lend further support to the theory. In a genetic study of
cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown
to be more robust and predictive.Comment: final versio
Distant Supervision for Entity Linking
Entity linking is an indispensable operation of populating knowledge
repositories for information extraction. It studies on aligning a textual
entity mention to its corresponding disambiguated entry in a knowledge
repository. In this paper, we propose a new paradigm named distantly supervised
entity linking (DSEL), in the sense that the disambiguated entities that belong
to a huge knowledge repository (Freebase) are automatically aligned to the
corresponding descriptive webpages (Wiki pages). In this way, a large scale of
weakly labeled data can be generated without manual annotation and fed to a
classifier for linking more newly discovered entities. Compared with
traditional paradigms based on solo knowledge base, DSEL benefits more via
jointly leveraging the respective advantages of Freebase and Wikipedia.
Specifically, the proposed paradigm facilitates bridging the disambiguated
labels (Freebase) of entities and their textual descriptions (Wikipedia) for
Web-scale entities. Experiments conducted on a dataset of 140,000 items and
60,000 features achieve a baseline F1-measure of 0.517. Furthermore, we analyze
the feature performance and improve the F1-measure to 0.545
- …